Objectives:
Examine residuals
Learn about leverage and influential observations
Straightening scatterplots
Read in the data set Penguins.csv using read.csv() and store the results in penguins. Use the clean_names() function from the janitor package on penguins to get consistent variable names. From this point forward, we will make a habit of using clean_names() for consistent variable names in our data frames.
library(janitor) # consistent variable names
penguins <- read.csv("Penguins.csv") %>%
clean_names()
knitr::kable(head(penguins))
| dive_heart_rate | depth_m | duration_min | bird |
|---|---|---|---|
| 88.8 | 5.0 | 1.050000 | EP19 |
| 103.4 | 9.0 | 1.183333 | EP19 |
| 97.4 | 22.0 | 1.916667 | EP19 |
| 85.3 | 25.5 | 3.466667 | EP19 |
| 60.6 | 30.5 | 7.083333 | EP19 |
| 77.6 | 32.5 | 4.766667 | EP19 |
Create a scatterplot of dive_heart_rate versus duration_min that resembles Figure 8.1 on page 240 of your text book.
ggplot(data = penguins, aes(x = duration_min, y = dive_heart_rate)) +
geom_point(color = "brown") +
theme_bw() +
labs(x = "Duration (minutes)", y = "Dive heart rate (beats per minute)")
Note the shape of the relationship in the scatterplot.
Find the least squares line for regressing dive_heart_rate onto duration_min and store the results in mod_pen.
mod_pen <- lm(dive_heart_rate ~ duration_min, data = penguins)
summary(mod_pen)
Call:
lm(formula = dive_heart_rate ~ duration_min, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-30.358 -8.356 -2.933 10.770 43.022
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.902 2.601 37.26 <2e-16 ***
duration_min -5.468 0.311 -17.58 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.11 on 123 degrees of freedom
Multiple R-squared: 0.7153, Adjusted R-squared: 0.713
F-statistic: 309 on 1 and 123 DF, p-value: < 2.2e-16
Plot the residuals for mod_pen versus the duration_min that resembles Figure 8.2 on page 240 of your text book. Start by creating a new data frame (NDF) based on mod_pen using the augment() function from the broom package.
library(broom)
NDF <- augment(mod_pen) %>%
clean_names()
knitr::kable(head(NDF))
| dive_heart_rate | duration_min | fitted | resid | hat | sigma | cooksd | std_resid |
|---|---|---|---|---|---|---|---|
| 88.8 | 1.050000 | 91.16056 | -2.360556 | 0.0270430 | 14.16888 | 0.0003996 | -0.1695718 |
| 103.4 | 1.183333 | 90.43149 | 12.968513 | 0.0262406 | 14.12050 | 0.0116840 | 0.9312164 |
| 97.4 | 1.916667 | 86.42160 | 10.978397 | 0.0221361 | 14.13485 | 0.0070043 | 0.7866580 |
| 85.3 | 3.466667 | 77.94617 | 7.353832 | 0.0151798 | 14.15465 | 0.0021248 | 0.5250752 |
| 60.6 | 7.083333 | 58.17015 | 2.429848 | 0.0080252 | 14.16882 | 0.0001209 | 0.1728681 |
| 77.6 | 4.766667 | 70.83774 | 6.762262 | 0.0111452 | 14.15716 | 0.0013084 | 0.4818501 |
ggplot(data = NDF, aes(x = duration_min, y = resid)) +
geom_point(color = "green4") +
theme_bw() +
geom_hline(yintercept = 0, lty = "dashed") +
labs(x = "Duration (min)", y = "Residuals")
Fix the model by transforming the response (take the
log of dive_heart_rate) and create a scatterplot of the transformed response variable versus duration_min.
# Your Code Goes HERE
Find the least squares estimates using the transformed response and store the results in mod_pen2
penguins <- penguins %>%
mutate(log_dive_heart_rate = log(dive_heart_rate))
mod_pen2 <- lm(log_dive_heart_rate ~ duration_min, data = penguins)
summary(mod_pen2)
Call:
lm(formula = log_dive_heart_rate ~ duration_min, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-0.47142 -0.13589 -0.01389 0.16653 0.44953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.628122 0.040181 115.18 <2e-16 ***
duration_min -0.093604 0.004805 -19.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.218 on 123 degrees of freedom
Multiple R-squared: 0.7552, Adjusted R-squared: 0.7532
F-statistic: 379.5 on 1 and 123 DF, p-value: < 2.2e-16
Plot the residuals of mod_pen2 versus duration_min.
# Your Code Goes HERE
Use mod_pen2 to predict the heart rate of a penguin from a dive of 10 minutes.
predict(mod_pen2, newdata = data.frame(duration_min = 10)) -> ans
ans
1
3.692084
exp(ans)
1
40.12837
Note: since mod_pen2 is the log (which by default in R is the natural log (ln)), to get the answer back in the original units use exp(ans).
Read in the data set Cereals and regress calories onto sugars and store the model in an object named mod_cer.
# Your Code Goes HERE
Create a histogram of the residuals from mod_cer that resembles Figure 8.3 on page 241 of your text book.
# Your Code Goes HERE
Using mutate() to create groups similar to Figure 8.4 on page 241 of your text book.
NDF4 <- cereals %>%
mutate(resid = residuals(mod_cer), fitted = fitted(mod_cer)) %>%
mutate(groups = ifelse(resid <= -25, "low", ifelse(resid <= 25, "med", "high")))
NDF4 %>%
filter(groups == "low") -> lowresid
NDF4 %>%
filter(groups == "high") -> highresid
knitr::kable(lowresid)
| name | mfr | calories | sugars | carbo | protein | fat | sodium | fiber | potass | shelf | middle | shelf_1 | shelf_2 | shelf_3 | resid | fitted | groups |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100%_Bran | N | 70 | 6 | 5 | 4 | 1 | 130 | 10 | 280 | 3 | No | 0 | 0 | 1 | -34.55946 | 104.55946 | low |
| All-Bran | K | 70 | 5 | 7 | 4 | 1 | 260 | 9 | 320 | 3 | No | 0 | 0 | 1 | -32.07444 | 102.07444 | low |
| All-Bran_with_Extra_Fiber | K | 50 | 0 | 8 | 4 | 0 | 140 | 14 | 330 | 3 | No | 0 | 0 | 1 | -39.64935 | 89.64935 | low |
| Golden_Crisp | P | 100 | 15 | 11 | 2 | 0 | 45 | 0 | 40 | 1 | No | 1 | 0 | 0 | -26.92463 | 126.92463 | low |
| Puffed_Rice | Q | 50 | 0 | 13 | 1 | 0 | 0 | 0 | 15 | 3 | No | 0 | 0 | 1 | -39.64935 | 89.64935 | low |
| Puffed_Wheat | Q | 50 | 0 | 10 | 2 | 0 | 0 | 1 | 50 | 3 | No | 0 | 0 | 1 | -39.64935 | 89.64935 | low |
knitr::kable(highresid)
| name | mfr | calories | sugars | carbo | protein | fat | sodium | fiber | potass | shelf | middle | shelf_1 | shelf_2 | shelf_3 | resid | fitted | groups |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Just_Right_Fruit_&_Nut | K | 140 | 9 | 20 | 3 | 1 | 170 | 2 | 95 | 3 | No | 0 | 0 | 1 | 27.98548 | 112.0145 | high |
| Muesli_Raisins,Dates,&_Almonds | R | 150 | 11 | 16 | 4 | 3 | 95 | 3 | 170 | 3 | No | 0 | 0 | 1 | 33.01544 | 116.9846 | high |
| Muesli_Raisins,Peaches,&_Pecans | R | 150 | 11 | 16 | 4 | 3 | 150 | 3 | 170 | 3 | No | 0 | 0 | 1 | 33.01544 | 116.9846 | high |
| Mueslix_Crispy_Blend | K | 160 | 13 | 17 | 3 | 2 | 150 | 3 | 160 | 3 | No | 0 | 0 | 1 | 38.04541 | 121.9546 | high |
| Nutri-Grain_Almond-Raisin | K | 140 | 7 | 21 | 3 | 2 | 220 | 3 | 130 | 3 | No | 0 | 0 | 1 | 32.95552 | 107.0445 | high |
ggplot(data = NDF4, aes(x = fitted, y = resid, color = groups)) +
geom_point()+
theme_bw()
Create a scatterplot of calories versus sugars color coded by groups. Add the least squares lines for the three different color coded groups to the scatterplot.
# Your Code Goes HERE
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()
Explain the interaction and additive models below in class.
mod_gro <- lm(calories ~ sugars*groups, data = NDF4)
summary(mod_gro)
Call:
lm(formula = calories ~ sugars * groups, data = NDF4)
Residuals:
Min 1Q Median 3Q Max
-17.6518 -4.8504 -0.5769 5.1496 22.2832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 112.6923 19.5002 5.779 1.85e-07 ***
sugars 3.4615 1.8747 1.846 0.06899 .
groupslow -62.1923 20.0090 -3.108 0.00271 **
groupsmed -15.0405 19.6033 -0.767 0.44548
sugars:groupslow -0.1154 1.9840 -0.058 0.95379
sugars:groupsmed -2.0283 1.8909 -1.073 0.28704
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.55 on 71 degrees of freedom
Multiple R-squared: 0.8201, Adjusted R-squared: 0.8074
F-statistic: 64.74 on 5 and 71 DF, p-value: < 2.2e-16
# Note groups high is the base
mod_gro$coefficients
(Intercept) sugars groupslow groupsmed
112.6923077 3.4615385 -62.1923077 -15.0404602
sugars:groupslow sugars:groupsmed
-0.1153846 -2.0283261
# LSE for high group
b0h <- mod_gro$coefficients[1]
b1h <- mod_gro$coefficients[2]
c(b0h, b1h)
(Intercept) sugars
112.692308 3.461538
# LSE for med group
b0m <- mod_gro$coefficients[1] + mod_gro$coefficients[4]
b1m <- mod_gro$coefficients[2] + mod_gro$coefficients[6]
c(b0m, b1m)
(Intercept) sugars
97.651847 1.433212
# LSE for low group
b0l <- mod_gro$coefficients[1] + mod_gro$coefficients[3]
b1l <- mod_gro$coefficients[2] + mod_gro$coefficients[5]
c(b0l, b1l)
(Intercept) sugars
50.500000 3.346154
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
geom_abline(slope = b1h, intercept = b0h, size = 0.25) +
geom_abline(slope = b1m, intercept = b0m, size = 0.25) +
geom_abline(slope = b1l, intercept = b0l, size = 0.25)
Load the moderndive package to use the geom_parallel_slopes() function to graph parallel lines. Create a scatterplot of calories versus sugars color coded by groups. Add parallel lines to the scatterplot.
library(moderndive)
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) +
geom_point() +
geom_parallel_slopes(se = FALSE) +
theme_bw()
Explain the code below and how to get the intercepts and slopes for an additive model.
# Parallel slope model
mod_ps <- lm(calories ~ sugars + groups, data = NDF4)
summary(mod_ps)
Call:
lm(formula = calories ~ sugars + groups, data = NDF4)
Residuals:
Min 1Q Median 3Q Max
-16.0031 -6.2125 -0.8984 7.1906 20.5938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.644 4.676 27.941 < 2e-16 ***
sugars 1.702 0.239 7.118 6.28e-10 ***
groupslow -73.017 5.581 -13.083 < 2e-16 ***
groupsmed -34.850 4.211 -8.275 4.28e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.921 on 73 degrees of freedom
Multiple R-squared: 0.7986, Adjusted R-squared: 0.7904
F-statistic: 96.51 on 3 and 73 DF, p-value: < 2.2e-16
# Note groups high is the base
mod_ps$coefficients
(Intercept) sugars groupslow groupsmed
130.643917 1.701577 -73.017416 -34.850289
# LSE for high group
b0h <- mod_ps$coefficients[1]
b1h <- mod_ps$coefficients[2]
c(b0h, b1h)
(Intercept) sugars
130.643917 1.701577
# LSE for med group
b0m <- mod_ps$coefficients[1] + mod_ps$coefficients[4]
b1m <- mod_ps$coefficients[2]
c(b0m, b1m)
(Intercept) sugars
95.793627 1.701577
# LSE for low group
b0l <- mod_ps$coefficients[1] + mod_ps$coefficients[3]
b1l <- mod_ps$coefficients[2]
c(b0l, b1l)
(Intercept) sugars
57.626501 1.701577
ggplot(data = NDF4, aes(x = sugars, y = calories, color = groups)) +
geom_point() +
geom_parallel_slopes(se = FALSE) +
theme_bw() +
geom_abline(slope = b1h, intercept = b0h, size = 0.25) +
geom_abline(slope = b1m, intercept = b0m, size = 0.25) +
geom_abline(slope = b1l, intercept = b0l, size = 0.25)
Recreate Figure 8.5 from page 242 of the text.
cereals <- cereals %>%
mutate(shelf = ifelse(shelf == 1, "bottom", ifelse(shelf == 2, "middle", "top")))
ggplot(data = cereals, aes(x = sugars, y = calories, color = shelf)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
labs(x = "Sugar (g)", y = "Calories")
Read in the Marriage_age_2017.csv data set and reproduce the analysis in Example 8.1. Hint: use mutate(time = ifelse(year <= 1940, "FP", ifelse(year <= 1979, "MP", "TP"))) to create a new variable time with three time periods.
ma2017 <- read.csv("Marriage_age_2017.csv") %>%
clean_names()
ma2017 <- ma2017 %>%
mutate(time = ifelse(year <= 1940, "FP", ifelse(year <= 1979, "MP", "TP")))
# Your Code Goes HERE
Read in the Election_2000.csv data and recreate Figure 8.9 from page 246 of your text.
election2000 <- read.csv("Election_2000.csv") %>%
clean_names()
p1 <- ggplot(data = election2000, aes(x = nader, y = buchanan, color = county)) +
geom_point() +
theme_bw() +
labs(x = "Nader (votes)", y = "Buchanan (votes)") +
guides(color=FALSE) # remove legend
p1
library(plotly) # create interactive graph
p2 <- ggplotly(p1)
p2
election2000_PB <- election2000 %>%
filter(county != "PALM BEACH")
dim(election2000_PB)
[1] 66 5
mod_no_pb <- lm(buchanan ~ nader, data = election2000_PB)
dim(election2000)
[1] 67 5
ggplot(data = election2000, aes(x = nader, y = buchanan)) +
geom_point() +
theme_bw() +
labs(x = "Nader (votes)", y = "Buchanan (votes)") +
geom_smooth(method = "lm", se = FALSE) +
geom_abline(intercept = coefficients(mod_no_pb)[1], slope = coefficients(mod_no_pb)[2])
Read in the Doctors_and_life_expectancy.csv data set and recreate Figures 8.13 and 8.14 from page 249 of your text.
dale <- read.csv("Doctors_and_life_expectancy.csv") %>%
clean_names()
names(dale)
[1] "country" "life_exp" "sqrt_tv_person"
[4] "sqrt_doctors_person"
ggplot(data = dale, aes( x = sqrt_doctors_person, y = life_exp)) +
geom_point(color = "brown") +
theme_bw() +
labs(x = expression(sqrt(Doctors/person)), y = "Life Expectancy (yr)", title = "Figure 8.13")
ggplot(data = dale, aes( x = sqrt_tv_person, y = life_exp)) +
geom_point(color = "blue") +
theme_bw() +
labs(x = expression(sqrt(TVs/person)), y = "Life Expectancy (yr)", title = "Figure 8.14")
Comment: What should we do to increase life expectancy?
Read in the Dirt_bikes_2014.csv data set and store the result in db2014.
# Your Code Goes HERE
Create a scatterplot of msrp versus displacement.
# Your Code Goes HERE
Regress msrp onto displacement and store the result in mod_db.
mod_db <- lm(msrp ~ displacement, data = db2014)
summary(mod_db)
Call:
lm(formula = msrp ~ displacement, data = db2014)
Residuals:
Min 1Q Median 3Q Max
-2520.8 -1134.2 213.6 1026.1 2350.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2856.8362 250.2051 11.42 <2e-16 ***
displacement 15.2567 0.9348 16.32 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1335 on 112 degrees of freedom
Multiple R-squared: 0.704, Adjusted R-squared: 0.7013
F-statistic: 266.4 on 1 and 112 DF, p-value: < 2.2e-16
Create a scatterplot of the residuals for mod_db versus the fitted values using the values in NDF5.
NDF5 <- augment(mod_db) %>%
clean_names()
# Your Code Goes HERE
Add the variable displacement_third to the db2014 data frame by raising displacement to the 1/3 power.
db2014 <- db2014 %>%
mutate(displacement_third = displacement^(1/3))
Create a new data frame NDF6 using the filter() function that contains only Liquid or Air cooled engines.
NDF6 <- db2014 %>%
filter(air_cooled == "Liquid" | air_cooled == "Air")
Create a graph similar to the second graph of page 251 of your text using the data in NDF6.
# Your Code Goes HERE
Modify your previous code from the previous graph to create a scatterplot with parallel lines for the different engine types.
# Your Code Goes HERE
Read in the Hand_dexterity.csv data set and create a scatterplot of time_sec versus age_yr that resembles Figure 8.20 on page 254 of your text book.
hd <- read.csv("Hand_dexterity.csv") %>%
clean_names() %>%
mutate(dominant = ifelse(dominant == 0, "Nondominant hand", "Dominant hand"))
# Your Code Goes HERE
Create a scatterplot of speed versus age_yr that resembles Figure 8.21 on page 254 of your text book.
# Your Code Goes HERE